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Abstract 



Despite the variety of protein sizes, shapes, and backbone configurations 
found in nature, the design of novel protein folds remains an open prob- 
lem. Within simple lattice models it has been shown that all structures are 
not equally suitable for design. Rather, certain structures are distinguished 
by unusually high designability: the number of amino-acid sequences for 
which they represent the unique ground state; sequences associated with 
such structures possess both robustness to mutation and thermodynamic 
stability. Here we report that highly designable backbone conformations 
also emerge in a realistic off-lattice model. The highly designable conforma- 
tion of a chain of 23 amino acids are identified, and found to be remarkably 
insensitive to model parameters. While some of these conformations cor- 
respond closely to known natural protein folds, such as the zinc finger and 
the helix-turn-helix motifs, others do not resemble known folds and may be 
candidates for novel fold design. 
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Introduction 

The de novo design of proteins-an object of enormous activity in recent 
years [|l|-^-has so far dealt primarily with the redesign of known protein 
folds. Two major accomplishments in the direction of designing a fold that 
is distinct from known natural folds are the synthesis of a right-handed coiled 
coil and the synthesis of a zinc finger without zinc |llO|-[l^. To challenge 
the best efforts of de novo design, nature offers roughly 1000 qualitatively 
distinct protein folds Why has it proven difficult to design new protein 
folds? What program should we follow to achieve ab-initio design of novel 
folds? 

The principle of designability offers an answer to both these ques- 

tions for simple lattice models. The designability of a structure is measured 
by the number of sequences that design it, i.e. the number of sequences 
that have the given structure as their unique lowest energy conformation. 
Structures can differ vastly in their designability , and it has been demon- 
strated that high designability entails other protein-like properties, such as 
mutational stability, thermodynamic stability ||l^,|T^, and fast folding ki- 
netics [16, ^|. Design is hard in the sense that most structures have low 



designability and their associated sequences lack these protein-like proper- 
ties. For successful de novo design, one should first identify the few highly 
designable structures. 

It is an open question whether designability applies to real proteins as it 
does to lattice polymers. Real protein structures have a degree of complexity 
that cannot be effectively represented within a simple lattice model. For 
example, on a lattice the angles between bonds differ from those naturally 



adopted in real proteins. Also, whereas in a cubic lattice model the cube 
minimizes surface area for a given volume and is perfectly packed, there 
exists no counterpart of the perfect cube once the lattice is removed. For 
designability to guide practical design of new folds it must apply to realistic 
descriptions of protein structure. 

In this paper we report the computation of designability within an off- 
lattice model that incorporates angles favored by natural proteins, for pro- 
tein chains of up to = 23 amino acids. We find that the essential quali- 
tative features of designability survive the transition from lattice model to 
off-lattice model. In particular, it remains true that a small fraction of 
compact structures are highly designable: these are nondegenerate ground 
states for an enormous number of amino-acid sequences. The vast majority 
of structures, on the other hand, are suitable ground states for few, if any, 
amino-acid sequences. Furthermore, the sequences that fold into highly des- 
ignable structures have enhanced thermodynamic stability - the energy of 
the nearest excited state is separated from the ground-state energy by an 
appreciable gap. 

Results 

Model 

Our off-lattice model is a 3-state discrete-angle model of the kind intro- 
duced by Park and Levitt ||2^, supplemented by uniform spheres centered 
on Ca and/or Cp positions, in order to account for excluded volume effects. 
The energy of a particular amino-acid sequence folded into a particular 
backbone configuration is evaluated as the vector-product of the hydropho- 
bicity of the sequence dotted with the (normalized) accessible surface area 
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of each amino acid in the chain [21|. 
Designability for a 23-mer 

The designability of a structure denotes the number of distinct HP- 
sequences having that structure as their unique ground state. Designabihty 
is an important attribute of a structure, since it quantifies how many mu- 
tations an amino-acid sequence can sustain while still folding to the given 
ground-state configuration. 

The distribution of designabilities for our model, displayed in Fig. ^, 
reproduces a crucial feature first observed on the lattice: While the vast 
majority of structures have very low designability, the trailing edge (or tail) 
of the distribution consists of a small number of structures of very high 
designability. Thus designability distinguishes a small subset of structures 
from generic ones. 

It turns out that the identities of these highly designable structures de- 
pend only weakly on the values of the parameters that enter our calculation: 
the surface area cutoff Ac, clustering radius A, sidechain radius r^, and the 
set of allowed dihedral angles, and the range of amino-acid hydrophobicities. 
More specifically, a significant fraction of structures identified as highly des- 
ignable for one set of parameter values remains highly designable when these 
parameters are varied. We provide evidence for this important observation 
in the next five subsections. 

Surface area cutoff 

As described in Methods, open structures are expected to exhibit low 
designability. We anticipate that the highly designable structures of interest 
to us will fall mainly within the class of compact structures, and therefore 



only these compact structures are needed in our calculation. The surface 
area cutoff Ac determines how compact a structure must be in order to 
qualify. We expect that, provided the choice of Ac is not too restrictive, its 
particular value ought not to be important. 

A computationally practical choice of the surface-area cutoff eliminates 
most of the less compact configurations. A few of these might have proven 
highly designable if retained; however our objective is not to find all highly 
designable structures, but only to identify some of them. Therefore, our 
major concern is not that we might incorrectly discard a few designable 
structures, but rather that we might produce false positives: structures that 
appear to be highly designable with a restrictive value of the cutoff but have 
low designability for a more relaxed cutoff. A larger cutoff admits previously 
disallowed configurations that "steal" some sequences from a configuration 
originally identified as highly designable thereby reducing its designability. 

In practice, as shown in Fig. |3|a, highly designable structures tend to 
remain highly designable with increasing surface-area cutoff. For example, 
9 of the 10 most designable structures remain within the 100 most designable 
even after the surface-area cutoff is relaxed sufficiently to admit a 10-fold 
increase in the number of participating structures. 

Clustering radius 

As discussed in Methods, structures whose backbones differ insignifi- 
cantly from one another ought not to be considered distinct. This observa- 
tion is embodied in our calculation by grouping into clusters those structures 
whose backbone configurations lie within a certain crms distance. A, of one 
another. Varying the clustering radius. A, leaves unchanged the set of con- 
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figurations that participate in the calculation. For A < 0.1 A, nearly every 
cluster consists of a unique configuration. To exhibit the dependence of the 
most designable structures on A, we fix a configuration, and follow the des- 
ignability of the cluster to which that configuration belongs, as a function 
of A. As shown in Fig. ^6, the most designable structures remain roughly 
the same as A is varied over a wide range. 
Sidechain radius 

Excluded-volume is incorporated by means of a hard sphere of radius 
centered on the /3-carbon of each amino acid. Increasing the sidechain radius 
eliminates some configurations because of steric clashes, while decreasing 
rp admits previously ineligible configurations. Starting at = 1.9A, we 
identify the most designable structures and then count the fraction of these 
structures which remain highly designable as rp is reduced. As shown in 
Fig. ^c, the identities of the most designable structures are well-preserved. 

Choice of angles 

Next, we address to what extent an outcome depends on a particular 
choice of the discrete set of dihedral angles. A discrete set of angles cannot 
sample the structure space fully, and so cannot "hit" all possible structures. 
On the other hand, we know that the designability of a structure depends on 
the local density of solvent-exposure vectors A [jl5[-with highly designable 
structures occupying the lowest density regions. If the subset of structures 
sampled by a discrete set of angles reasonably preserves density in the space 
of structures, highly designable structures should remain highly designable 
as we improve our sampling of structure space. 

To examine this possibility, we identify configurations generated by one 
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angle set and follow their cluster designabilities as configurations from other 
angle sets are added. We take five different angle sets derived from fitting to 
IPSV, and use the most compact configurations generated by each set. We 
calculate the designability of structures using configurations from, respec- 
tively, one, two, three, four, and finally all five sets. We observe in Fig. 
that the most designable structures in set ^1 remain highly designable even 
as configurations from sets #2, #3, #4, and #5 are added. This result is 
maintained under permutation of the five sets. Apparently, any reasonable 
choice of angle set covers the structure space sufficiently well that highly 
designable structures can be identified with high probability. 
HP sequences 

To check whether the identification of designable structures depends on 
our use of HP (binary) sequences of amino acids, we recalculate designabil- 
ities using amino acids with continuous real- valued hydrophobicities. We 
randomly choose 4,000,000 sequences h = (/ii, • • • , /iat), where hi G [0,1], 
and evaluate their energy for all configurations using equation (||). In Fig. 
we plot the designability calculated this way against that from the enumer- 
ation of HP sequences. As the figure shows, the highly designable structures 
computed by these two alternative methods are nearly identical. 

Parameter Independence 

In the preceeding five subsections we have demonstrated that the pa- 
rameters can sustain a considerable degree of variation without significantly 
changing the outcome of the designability calculation. The weak depen- 
dence of the set of highly designable structures on parameters is illustrated 
in Fig. ^. Because the identity of the highly designable structures is robust 
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to parameter variation, we now examine their potential as candidates for 
design. 
Gap 

In particular, a prerequisite for design is believed to be the presence of 
a large separation between the ground-state energy and the energy of the 
lowest excited state. For each structure, we have identified the HP-sequence 
that makes this gap the largest. The value of this largest gap is shown in 
Fig. ^, as a function of the designability of the structure. 

To convert the vertical scale of Fig. ^ to real energies, we observe that one 
unit of energy corresponds to a sequence of exclusively hydrophobic amino 
acids {hi = 1) folded into one of our typical compact structures. Our choice 
of surface area cutoff Ac guarantees that a typical compact configuration 
has around half of its maximal accessible surface exposed - about 25A^ 
per residue. A conservative estimate for the energy of exposed surface, 
20 cal/A^/mol then yields an energy on the order of 10 kcal/mol for a 
23-mer. The highest gap energies achieved in Fig. ^, of order 0.05, therefore 
correspond to a gap of 0.5 kcal/mol, around UbT for room temperature. 
This gap is roughly the energy to promote one hydrophobic amino-acid 
from core to surface. 

Also plotted is the average gap for all HP-sequences which design a 
structure. It is evident that high designability correlates strongly with a 
large gap. 

Discussion 

Designability off lattice 

The principle of designability is that some protein structures are intrin- 
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sically easier to design than others. However, up to now, designabihty has 
been demonstrated only in highly restrictive lattice models. Our calcula- 
tions indicate that the qualitative features of designabihty in lattice models 
are also exhibited off-lattice. Namely, a small minority of off-lattice struc- 
tures are distinguished by high designabihty: these structures are lowest- 
energy states for many more than their share of sequences. Moreover, the 
sequences associated with these structures have enhanced thermodynamic 
stability. The work presented here, using a realistic off-lattice model for 
protein-backbone configurations, makes it more plausible that designabihty 
applies to real proteins. 

Highly designable structures 

The insensitivity to model parameters of the results presented suggests 
that our highly designable structures are possible candidates for real protein 
design. It is therefore worthwhile to study some of our best candidates 
in detail, and to understand what architectural properties distinguish the 
most designable structures from the least designable ones, and how the most 
designable ones compare with known natural structures. 

Representative configurations of some of the most designable structures 
are shown in Fig. |^a-c. A striking characteristic of the highly designable 
structures is that each has a well-defined core consisting of a small subset 
of the amino acids of the chain. For example, in Fig. |^ we have plotted the 
inaccessible surface area of each amino acid along the chain for the configu- 
ration appearing in Fig. ^b. Observe that 5 of the 23 amino acids are more 
than 70% buried. Also shown in Fig. ^ is the probability that a hydrophobic 
amino acid occupies a particular site, averaged over all HP-sequences that 

10 



design the structure, revealing the preference of hydrophobic amino acids for 
the core. A quantitative measure of the core in a structure is the variance 
vs of the exposure vector A: vs = {l/N) J2i of - (l/A^^)(Z]i In Fig- 1; 
we plot Vs versus the designability Ns- On average the two quantities cor- 
relate well; however, the scatter of the data is large in the region of low Ns'- 
structures with well-formed cores are not necessarily highly designable. 

A zinc-finger-like fold emerges from our calculation as one of the most 
designable structures. The fold (Fig. does not simply replicate IPSV 
(Fig. ||c?), on which we optimized our angle set. The structure of IPSV is 
too open to be designable within our model because the small, uniformly- 
sized sidechains cannot fill the large opening between the a-helix and the 
f3-f3 turn in IPSV. Interestingly, the model produces a highly designable 
solution by collapsing the a-helix onto the f3-f3 turn. 

Another of our most designable structures is similar to another small 
natural fold, the helix-turn-helix (see Fig. |l|c). 

Some of our most designable structures (e.g., that shown in Fig. ||a) do 
not resemble any known natural folds. These structures are candidates for 
the design of truly novel folds. 

Targeting a fold by fitting the angle set to a chosen structure is not es- 
sential. For example, we can obtain a suitable angle set by choosing two 
pairs of dihedral angles ((/), ip) within the /3-sheet region and one pair from 
the a-helix region, locally optimizing on 160 representative natural struc- 
tures from the PDB database |^0j. Among the most designable structures 
emerging for this angle set is the zinc-finger-like structure in Fig. ^a, shown 
next to its apparent natural counterpart, 1NC8 Q| (Fig. ^6). 

11 



Conclusions 

In summary, we have computed the designabihties of structures within 
an off-lattice model of realistic protein-backbone configurations. Highly 
designable structures emerge with remarkable insensitivity to model param- 
eters. The sequences which design these structures have strongly enhanced 
mutational stability and a large energy gap between the native fold and the 
lowest non-native conformation. In this light, it is interesting that recent mu- 
tation studies on some small proteins show that they maintain their native 



folds even when about half of their residues are replaced by alanine |24,25|. 
Some of our highly designable structures correspond closely to natural folds, 
such as the zinc-finger and helix-turn-helix motifs. Others do not resemble 
existing structures, and are candidates for ab-initio design of novel protein 
folds. 

Methods 

Model 

The model we adopt is closely related to the off-lattice, m-state discrete- 
angle model introduced by Park and Levitt [^0|. Each configuration is de- 
fined by a sequence of Ca bonds of length 3.8A, and each pair of dihedral 
angles {(p, ip) is restricted to one of only m alternatives; here we take m = 3. 
The set of m allowed angle pairs is chosen by fitting to the backbone co- 
ordinates of representative natural proteins ||2^, as discussed below. To 
suppress self-intersections of the chain, we augment the model by introduc- 
ing a volume for the amino-acid residues in the form of a sphere of radius r/3 
centered on Cp (the first carbon of the sidechain). The backbones of some 
configurations constructed in this fashion are shown in Fig. ^a-c. 
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This ofF-lattice model incorporates properties of real polymers not well 
reproduced in simple lattice models. On the lattice, for example, allowed 
ground-state structures were limited to those maximally compact structures 
that fill the unique rectangle or box of minimum surface area. Off the lattice, 
every structure can be expected to have a distinct surface area, but once 
again, open or extended structures are not expected to be designable. We 
entertain as plausible ground-state structures only those with a surface area 
below some cutofi^ value Ac, which enters our computation as a parameter^. 

Because a discrete angle set represents only a crude approximation to a 
continuum of angles, it is unrealistic to expect the surface area of a discrete- 
angle structure to faithfully reproduce the surface area of a structure built 
from more flexible angles. Importantly, using flexible angles would allow 
our more open structures, e.g. those just below the cutoff Ac, to contract 
and reduce their exposed surface areas. To achieve this equalizing effect of 
a continuum of angles within the limitations of a discrete-angle model, we 
normalize the vector of solvent-accessible surface areas A = (ai, • • • , ajv), 
where Oj is the solvent-accessible surface area of the i-th residue, in such a 
way as to preserve the pattern of surface exposure along a chain. A suitable 
procedure^ is to normalize the vector A for each structure by the total 
exposed surface area of that structure: A = A/J2i (^i = (^iii ' ' ' i O'n)- This 
procedure treats all structures below the cutoff Ac as equally compact, while 
preserving each structure's individual pattern of surface exposure along the 
chain. 

Clustering 

As with real proteins, description and comparison of configurations off- 
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lattice demands precision about what we mean by the term "structure." 
For example, a protein structure obtained by NMR represents an ensem- 
ble of configurations, no element of which necessarily provides a better fit 
to the data than any other. This ensemble presumably reproduces the 
temperature-induced fluctuations of a natural protein around its native 
state. On averaging over this ensemble for small stably-folded polypep- 
tides in the PDB database, one flnds a typical crms of roughly 0.3 — 0.5A 
per residue. A similar range of crms can be inferred from the B values of 
protein crystals |^]. Accordingly, our off-lattice polymer configurations are 
grouped into clusters consisting of all configurations lying within a crms 
distance A per residue of one another. Configurations within a cluster are 
to be thought of as variations of a single structure, and we refer to clusters 
and structures interchangeably. 
Designability 

We define the designability of a structure as the sum of the designabil- 
ities of its included configurations. The designability of a configuration is 
simply the number of sequences with that configuration as a unique ground 



state [14,15|]. To evaluate the energy of a sequence on each configuration, we 
associate a hydrophobicity hi with each amino acid of the sequence. In prac- 
tice, we assign a hydrophobicity which is either (Polar) or 1 (Hydrophobic) 
to each monomer to create an HP-sequence that this is a reasonable 
simplification finds support in the work of Hecht and co-workers ^ {cf. 
Fig. for the results of a more general choice). The energy of a particular 
sequence folded into a particular configuration is obtained by taking the sum 
of the products of each amino acid's hydrophobicity hi with its normalized 
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surface exposure di, 

E = Y^hfai. (1) 

i 

We numerically evaluate the energy of all HP-sequences for all configura- 
tions. 

Parameters 

Except as indicated explicitly in the text, we have chosen discrete angles 
and the amino-acid radius to optimize the fit to the backbone of the zinc- 
less synthetic zinc finger IPSV [|l2| (Fig. We find that there are many 
angle sets that fit the backbone of IPSV almost equally well. For example, 
the crms per residue between IPSV and the structure obtained from each of 
our 10 best angle sets varies from 0.844A to 0.913A. The angle set we use 
for most of the calculations presented in this paper is = (—95°, 135°) 

(/3-region), (—75°,— 25°) (a-region), and (—55°,— 55°) (o-region). We take 

= 1.9A, the radius above which the amino acids fit to the backbone of 
IPSV would clash. 
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Footnotes 



* Present address: Department of Physics, George Washington University, 
Washington, D.C. 20052, USA. 

f To whom reprint requests should be addressed. E-mail: tang@research.nj.nec.com. 
X We evaluate the area of each sphere accessible to a probe sphere of 



radius lAA, by the methods used in the program SERF |21]; the slightly 
different values of surface area obtained by the different methods do not in 
any way alter the outcome of the calculations. 

§ We have checked that certain alternative normalizations (for example, 
normalizing by the total solvent-inaccessible surface area) do not alter the 
set of highly designable structures that emerge from our calculation. With 
no normalization, higher designability becomes closely correlated with lower 
solvent-accessible surface area. 
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gure Captions 

1. (a)-(c) Backbone configurations of 1st, 4:th, and 15th most designable 
23-mer structures, (d) Backbone configuration of tfie zinc finger IPSV 



121, truncated to 23 amino acids. 



2. Histogram of designabilities of 23-mer structures, using = 1.9A. 
The surface area cutoff Ac is such that 10,000 configurations partici- 
pate in the calculation, grouped into 4688 clusters with cluster radius 
A = OAA. 

3. Sensitivity to parameter changes of the most designable structures 
from Fig. (a) Fraction of the 10, 20, 40, or 60 most designable 
structures which remain in the 100 most designable as the surface- 
area cutoff increases. The initial cutoff Ac is chosen so that only the 
1000 most compact configurations participate and Ac increases until 
10,000 configurations participate, {b) Fraction of the 10, 20, 30, or 40 
most designable structures which remain in the 50 most designable as 
the clustering radius A is increased. The 5000 most compact config- 
urations participate in the calculation and rp = 1.9A. (c) Fraction 
of the 10, 20, 40, or 60 most designable structures which remain in 
the 100 most designable as the sidechain radius is changed. We 
have chosen the surface area cutoff so that 5000 structures partic- 
ipate in the designability calculation for = 1.9 A. If some con- 
figurations of the original most designable structures are not among 
the 5000 most compact configurations for some smaller r^, we nev- 
ertheless retain them in the calculation. The clustering radius is 
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A = OAA. (d) Fraction of the 10, 40, 70, or 100 most designable 
structures which remain in the 100 most designable as configurations 
from other angle sets are added. The values of the five angle sets 
are: set #1 = (-95°, 135°), (-75°, -25°), (-55°, -55°); set #2 = 
(-95°, 135°), (-85°, -55°), (-65°, -25°); set #3 = (-105°, 145°), 
(-85°, -15°), (-75°, -35°); set #4 = (-105°, 145°), (-85°, -35°), 
(-85°, -5°); set #5 = (-105°, 145°), (-85°, -35°), (-85°, -15°). (e) 
Designability of structures obtained from 4,000,000 randomly gener- 
ated sequences of real numbers in [0,1] versus designability from enu- 
meration of HP-sequences. The 10,000 most compact configurations 
participate in the calculation, A = 0.4A, and rp = 1.9A. 

4. Maximum energy gap (red dots) and average energy gap (black dots) 
for the HP-sequences which design a given structure, plotted versus 
structure designability. The 10,000 most compact configurations of the 
23-mer participate in the calculation, with A = 0.4A and rp = 1.9 A. 

5. Solid bars: Inaccessible surface for residues {Cp spheres) of the highly 
designable configuration shown in Fig. lb. Hollow bars: Probability, 
averaged over all HP-sequences that design the configuration, that ra 
particular site along the chain is occupied by a hydrophobic amino 
acid. 

6. The average variance vs of a cluster against the designability Ng of 
the cluster for the 23-mer. The 5000 most compact configurations 
participate in the calculation, A = 0.4A, and rp = 1.9A. Red line: 
running average with bin size 30. 
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7. (a) Backbone configuration of the 11th most designable 23-mer struc- 
ture, using untargeted angle set (see text): (0, ■0) = (—55°, 135°), 
(-126°, 145°), and (-85°, -25°), with a mean crms of 3.6l on a rep- 
resentative subset of natural structures segmented into sub chains of 
21 amino acids. For this calculation, the amino acids are represented 
by spheres of radius = 1.52A centered on the Ca carbons only, (b) 



Backbone configuration of the zinc finger 1NC8 [23|, truncated to 23 
amino acids. 
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Figure 1: Miller, et al. 
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Figure 2: Miller, et al. 
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Figure 3: Miller, et al. 
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Figure 4: Miller, et al. 
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Figure 5: Miller, et al. 
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Figure 6: Miller, et al. 
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Figure 7: Miller, et 
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